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Abstract 


Quantifying the similarity between two entities constitutes a particularly interesting and useful operation in several 


theoretical and applied problems. Aimed at quantifying the similarity between two sets, the Jaccard index has been 





extensively used in the most diverse types of problems, also motivating some respective generalizations. The present 


work addresses further generalizations of this index, including its modification into a coincidence index capable of 


accounting also for the level of interiority of the sets, an extension for sets in continuous vector spaces, the consideration 





of weights associated to the involved set elements, the generalization to multiset addition, densities and generic scalar 





fields, as well as a means to quantify the joint interdependence between random variables. The also interesting possibility 


to take into account more than two sets was also addressed, including the description of an index capable of quantifying 





the level of chaining between three sets. Several of the described and suggested generalizations have been illustrated 


with respect to numeric case examples. It is also posited that these indices can play an important role while analyzing 


and integrating datasets in modeling approaches and pattern recognition activities, including as a measurement of 


clusters similarity or separation. 


‘Riedificano Ersilia altrove. Tessono com i fili una figura 
simile che vorrebbero pit complicata e insieme pit regolare 
dell'altra.’ 


Italo Calvino, Le Città Invisibili. 


1 Introduction 


Despite its seeming simplicity, set theory underlies a sub- 
stantial portion of the mathematical and physical sci- 
ences, while being also extensively used in virtually every 
area of human activity. In fact, set theory concepts are so 
ubiquitous as to be incorporated into language and daily 
conversations. When one says “I will buy bananas and 
potatoes and tomatoes,” it is actually the set operation 
of union that it is being meant. Interestingly, the tenuous 





border between set theory and propositional logic is often 
blurred by humans (see |1|). At the same time, multi- 
sets (e.g. [2, 3, 4]) offer means for extending set so as to 
consider the multiplicity of elements. 

Another concept that is as ubiquitously employed in 
every human activity regards the concepts of similarity 
and distance between two entities. Mathematically, this 
can be related to quantifying in an objective manner sev- 


eral types of similarity between two or more mathematical 


structures such as scalars, sets, vectors, matrices, func- 
tions, densities, graphs, etc. This can be done in several 
manners, which frequently take into account the respec- 
tive type of structure. For instance, vectors are often 
compared in terms of their inner product, and several 
similarity indices (e.g. [5]) have been suggested for com- 
paring matrices with binary features. 

One approach to the similarity between two sets that 
has attracted particular attention as a consequence of its 
interesting characteristics, being therefore employed ex- 
tensively, is the Jaccard or Tanimoto index (e.g. [6, 7]). 
In addition to being constrained within the interval [0, 1], 
the Jaccard index is also intuitive, relatively simple and 
requires little computational expenses. Besides its vast 
range of applications (e.g. [8, 6, 9, 10, 11]), the Jaccard 
index has also motivated some extensions and generaliza- 
tions, including its adaptation to discrete multisets with 
positive multiplicites (e.g. [2, 3, 4]). 

Given the popularity of the Jaccard index, as well as its 
appealing characteristics, it would be particularly useful 
if it could be adapted to as many as possible other math- 
ematical structures. The present work aims at developing 
further possible generalizations of the Jaccard index. 

We start by focusing on the relative limitation of this 
index to reflect to which level one set is contained into the 
other, and a respective adaptation of the Jaccard index 


is then proposed to address this limitation that involves 
another measurement between two sets, here called inte- 





riority index. More specifically, we define the coincidence 
index between two sets as corresponding to the square 
root of the product of respective Jaccard and Interiority 
indices. 

We then approach the adaptation of the Jaccard in- 
dex to take into account sets corresponding to regions in 
continuous spaces such as Re. It is shown that this can 
be immediately accommodated into the standard Jaccard 
(and also the coincidence) indices by having the regions 
area in place of the sizes of sets. In addition to allow- 
ing useful graphical characterizations of the Jaccard and 
coincidence indices, this extension to continuous sets also 
paves the way to dealing with densities and scalar fields. 
In particular, we develop a related graphical construct to 
compare the relationship between the Jaccard, interiority 
and coincidence indices. 

Another interesting possibility covered in the present 
work concerns the characterization of the similarity of sets 
whose elements have been assigned to weights expressing 
the respective relevance. This was achieved by a simple 
modification of the Jaccard and coincidence indices. 

Next, we address the particularly interesting question 
of adapting the Jaccard index to become capable of com- 
paring densities and functions in continuous spaces RY, 
which correspond to generic scalar fields on those do- 
mains. This is achieved by extending the multiset version 
of the Jaccard index to incorporate integrals (function- 
als) of the minimum and maximum operations along the 
respective space. Because generic fields involve negative 
values, the approach is presented first for non-negative 
densities, being subsequently extended to more general 
fields with negative multiplicities. The potential of the 
approach, which is conceptually and computationally sim- 
ple, is then illustrated with respect to comparing proba- 
bility density functions as well as more generic functions 
corresponding to two sinusoidals as well as two real-world 
images. 

Another issue of special relevance that has been ad- 
dressed regards the inherent, but not often considered, 
relationship between the quantification of the similarity 
between two probability densities with the also ample 
subject of characterizing joint variation of two random 
variables. Two particular problems are addressed. We 
show that the multiset Jaccard adaptation to densities 
and functions can be effectively applied to quantify the 
joint relationship between two random variables, be it 
in terms of discrete observations or while taking into ac- 
count their probability densities describing standardized 
versions of the involved variables. The methodology in- 
volves reflecting negative multiplicities across quadrants, 
therefore allowing the treatment of joint distributions and 


scalar fields with negative values. 

The last topic approach described in the present work 
concerns the possibility of generalizing the Jaccard index 
to deal with more than 2 sets. We argue that there are 
two main ways in which this problem can be addressed. 
First, it is possible to have any of the two sets involved in 
the Jaccard index to correspond to generic combinations 
of any number of sets, obtained by using set operations. 
Alternatively, more than two sets can be actually con- 
sidered as arguments of extended Jaccard indices. The 
latter possibility has been illustrated through the devel- 
opment of a generalization of the Jaccard index capable 
of quantifying the degree of chaining between three sets, 
as intermediated by one of them. 

The article concludes by discussing the particularly im- 
portant role of indices such as those discussed and sug- 





gested here for the ubiquitous activities of model building 
and pattern recognition. Some prospects for future devel- 
opments are also provided. 


2 The Basic Jaccard Index 


The basic Jaccard index can be simply expressed as: 


— |ANB | IAN B| 


KA, B) = 30B jaj Bane Y 


where A and B are any two sets to be compared. 

It is interesting to keep in mind that, though not fre- 
quently specified, the universe set of A and B can be 
conveniently taken as being equal to Q = AU B. 

The Jaccard distance can be immediately derived from 
the Jaccard index by making: 


DJ(A,B)=1- F(A, B) (2) 





This approach can be immediately extended to any 
other similarity index bound between 0 and 1. 

It is also interesting to observe that it is possible to 
modify the Jaccard index so as to reflect in absolute terms 
the effective cardinality of the intersection of the two sets. 
This can be done as: 


_ |AN BI? 


J,(A, B) = JAUB] (3) 


The Jaccard index can be immediately generalized to 
multisets or bags (e.g. |12, 13]), which are basically sets 
The multi- 
sets A and B can be represented as respective vectors 
ar B= lb, bg,...,b~|, where N is the 


total number of possible distinct elements in the universe 





in which repeated elements are allowed. 





A= lai, a@2,.. 


defined by the union of the two multisets, and a; corre- 
sponds to the multiplicity of element 2 in the multiset A. 


The Jaccard index for multisets then becomes: 


D min (a;, bi) 


A, B) = 
A Da max (a;, bi) 


(4) 
with 0 < Jm(A, B) <1. 

As an example, let’s consider A = {a,a,a,b,b} and 
B = {a,a,b,c,c,d}. If we have the set of possible elements 
organized into the indexing vector p = [a, b,c, d|, we will 
obtain A = (3, 2,0, 0] and B= [2,1,2,1]. Observe that 
the order of elements in p is immaterial to our analysis. 
The, we have: 


2+1+0+0 3 


D> eee a (5) 


A = = 
IA, S4242+1 8 


As a consequence, this adaptation of the Jaccard in- 
dex allows it to be applied also to vectors, matrices, and 
graphs. In the case of matrices, the Jaccard equation can 
be further modified as: 


N N : 
N N 
Dini 2 j=1 MAX (Gij, bij) 


Observe that many other mathematical structures, such 


Jm(A, B) = (6) 


as matroids, tensors, etc., can be compared by further 
adapting the above equation. 


3  Interiority and Coincidence In- 
dices 


As illustrated in the previous sections, and also by the 
relatively extensive related literature, the Jaccard index 
provides an intuitive and logical manner to quantify the 
similarity between two discrete or continuous set. Yet, 
there is one particular situation, illustrated in Figure 1, 
which is not accounted for by this index. 


a AQB p ANB 
ll 
(a) (b) 


Figure 1: Two distinct situations involving two sets A and B that 
yield the same Jaccard index value of 3/7. However, the two sets 
in (b) are much more compatible because B is a subset of A and 
therefore shares all its elements. 


As it can be easily verified, both the situations depicted 
in Figure 1 lead to the same Jaccard index J = 3/7. 
However, the situation in (b) can be deemed to be quite 


distinct because, in this case, the set B is completely con- 
tained in A to the point of becoming a subset, i.e. B C A. 
In other words, all elements of B are shared with the set 
A. This is not the case in the situation (a), for both sets 
A and B have elements that are not shared. 

It therefore follows that it would be interesting to ob- 
tain a modification of the Jaccard index that could distin- 
guish between these two situations. A possible approach 
is described as follows. 

We start by considering an index capable of quantifying 
how much a set is interior to another. Let A and B be 
any two sets. The henceforth called interiority index can 
be written as: 


ANBI 


HA, B) = TAL IB g 


It can be verified that 0 < Z(A,B) < 1. Its minimum 
value is observed when A is completely separated from 
B, i.e. AN B = 0. The maximum value is reached when 
any of the sets is completely contained into the other. In 
other words, there is no need to specify which of the two 
sets is being considered as being internal to the other. 

By comparing Equations 28 and 28, it follows that: 


0<I(A,B)< J(A,B)<1 (8) 
and it can also be verified that: 
0< J(A, B) H(A, B) <1 (9) 


The verification of similarity accounted for by the Jac- 
card index can be conveniently combined with the interi- 
ority index simply by considering their respective product, 
l.e.: 


which is the same as: 


7 IAN B|? 
CADIS AGB mney 9» 


It may also interesting to take the square root of the 





coincidence in order to compensate for the heterogeneity 
implied when multiplying to measurements in the interval 
[0, 1]. Therefore, in the present work we will adopted the 
coincidence indes as: 
C(A, B) = / S(A, B) L(A, B) (12) 

In some specific cases it is also possible to use these 
two indices separately, defining a corresponding tuple 
[I (A, B), Z(A, B)). 

The consideration of continuous sets in the next section 
provides an interesting resource for comparing between 
several indices. 


4 Continuous Sets 


The henceforth described approach holds for Rï, but 
we shall consider the plane vector space R?. It 
is possible to associate sets to the points (x,y) of 
this space in any possible manner, such as R = 
{(x,y) | x and y are even}, which is a discontinuous in 
R?, or S = {(x,y) |0 < x < 2,—1 < y < 1}, which defines 
a continuous region. 

Though the Jaccard index can be immediately applied 
to any of these sets, it is of particular interest to our de- 
velopments to consider sets configurations corresponding 
to simple connected regions of R? such as those illustrated 
in Figure 2. 


(a) (b) (c) 


Figure 2: The three most relevant situations to be considered when 
comparing two sets: (a) no intersection; (b) partial interesection; 





(c) complete intersection. 


In this case, the size of the sets can be conveniently 
substituted by the respective areas, indicated as |A|, |B|, 
|AN B|, and |A U B|, which can be immediately used in 
Equation 28. 

The three cases in Figure 2 also corresponds to the 
most representative situations when comparing two sets. 
In Figure 2(a), we have two separated sets, which results 
in null intersection, suggesting minimal similarity between 
the two sets. The situation depicted in (c) can be under- 
stood as leading to the maximum similarity that can be 
achieved with the sets A and B. Figure 2(b) illustrates a 
frequently found situation in which there is some intersec- 





tion between the sets. In this case, it would be expected 
that the similarity increases with the intersection area. 





The situation represented in Figure 2(b) actually incor- 
porates the two other situations as limit cases. Consider 
the diagram shown in Figure 3, involving two square re- 
gions A and B, with respective sides a and b, b < a. 


The relative position, and also the similarity, of the two 
sets can be completely controlled in terms of the relative 
position parameter x, with ae <0 £ Le, As b in- 
creases, the two squares progressively separate, therefore 
becoming less similar. 

There is only one other parameter that needs to be 
specified in order to completely represent the situation 
in Figure 3, namely the relative sizes of the two regions 


r= b/a, with0<r<1. 

















Figure 3: A construction representing all possible situations regard- 
ing the similarity of two sliding squares A and B with sides a and 
b, respectively. Without loss of generality, we assume that a > b. 
Any of these situations can be specified by just two parameters: the 
relative position x and the relative size r = b/a. This construction 
allows us to better understand the behavior of the Jaccard and other 
similarity indices covered in this work. 


The area of the intersection AN B and union AU B of 
the two sets can now be conveniently expressed in terms 


IAN B| = (ra) (5 = (x - 5)) = 
(a*r(1 +r)— 2rax) (13) 


of x and b as: 


N| = 


AUB e: 2a? (1+r°)—a°r(1+r)+2rax) (14) 
2 


We can now rewrite the Jaccard index as: 


2 
a r(1 +r) — 2rax 
A, B) = —— 15 
cain) 2a? (1+r?)—a’r(1+r) + 2raxr e 
Figures 4 (a) to (c) present the Jaccard index for two 
rectangles, as developed above, in terms of several config- 


urations of the parameters x and r. 


5  Addition-Based Multiset Jac- 
card Index 


The multiset Jaccard index can be further generalized by 
taking into account the sum of the two sets A and B 
instead of their respective union, which leads to: 


2 D min (ai, bi) 


ce Si- (ai + bi) 


(16) 
with 0 < Js(A, B) <1. 

The interesting feature of this index is that it takes into 
account the situations where the multiple instances in the 
multisets need to be taken into account at its fullest when 
combining the sets. 

As an example, let’s consider that A = {a,a,a, b,c, c, c} 
and B = {a,a,c}. Then, we have that: 
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Figure 4: The Jaccard (a), interiority (b), and square root of the coincidence (c) indices obtained for the geometrical construction illustrated 


in Figure 3. The heat map increases from yellow to brown. The incorporation of the interiority level into the Jaccard index leads to a more 


comensurated distribution of the level sets. The maximum value the Jaccard and continuity indices are to be found at the lower righthand 





corner of the respective plots. The values of the interiority index increase linearly from the top to the bottom diagonal in (b). Along (d) 


to (f), we have analogous results concerning the additive multiset Jaccard index (d), the interiority (e), and square root of the additive 


coincidence index, the latter corresponding to the p (f). The latter index corresponds to the product of the additive multiset Jaccard and 


interiority indices. 


and: 


(2)(3) _ 3 
Js(A, B) = 7435 (18) 

Figures 4(d) to (f) depict the results obtained for the 
additive multiset Jaccard index considering the same con- 
struction as described in Figure 3. The additive multised 
Jaccard index can be immediately combined with the re- 
spective interiority index to yield the additive multiset 
coincidence index. 

The geometrical construct in Figure 3 offers an interest- 
ing approach to comparing the varying results obtained 
in Figure 4. The rationale is as follows: as the square 
B slides from being completely inside the square A, until 
it becomes completely outside the latter, it is reasonable 
to expect the similarity to decrease in a linear manner 





with x. This suggests that we can compare the several 
indices in Figure 4 while taking into account their respec- 
tive slice along vertical slices of the scalar fields. For sim- 
plicity’s sake, we will consider five slices corresponding to 
b = 10, 20, 30, 40,50. The results are shown in Figure 5. 


Among the five indices in Figure 4, only the additive 
multiset Jaccard index accounts for linear similarity quan- 
tification as x varies also in linear manner. The interiority, 
as expected, is unable to consider the relative size of the 
sets. The coincidence indices penalize the similarity for 
small values of x (i.e. when the slices are further away), 
with the basic coincidence index being more strict. Also, 
the basic Jaccard tends to penalize these cases more in- 
tensely than the additive multised Jaccard index. 

None of these indices are absolutely better than the 
others. It is the specific requirements of each application 
that should lead to a suitable choice while considering the 
above identified properties of each index. 


6 Weighted Discrete Elements 


The Jaccard and coincidence indices can be readily 
adapted to cope with cases in which the elements of sets A 
and B have been assigned respective weights correspond- 
ing to their relative importance in each specific problem. 
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Figure 5: Five vertical slices of the respective indices in Fig. 4. Only the interiority and additive multiset Jaccard indices account for the 
expected linear decrease of similarity following x displacements. However, the interiority is unable to take into account the relative size of 
the squares, which leaves only the additive multiset Jaccard as presenting a complete and linear quantification of the similarity between the 
two sets. The square root of the coincidence indices both penalizes the similarity when x is small, with the basic coindicence index imposing 








the most severe test of similarity. 


This situation can be approached by using ordered pairs 





to represent each of the elements in AU B associated to 
its respective weigh, i.e. [x;, w(a;)]. The Jaccard index 
then becomes: 


2 ai€(ANB) w(x) 


y;€(AUB) w(yi) 


(19) 
with 0 < J (A, B) <1. 

As an example, let A = {|a, 2]; [b, 5]; [c,1]} and B = 
{|b, 5]; [e, 1], [f, 1] }. It follows that A N B = {[b, 5]}. 
oð l 
10 2 


Thus, in spite of in this particular example the intersec- 





tion being limited to a single of the possible elements, the 
Jaccard index resulted relatively high as a consequence of 
the large weigh associated to the element b. 

Observe that the weighted version of the Jaccard index 
is not the same as the Jaccard index adapted to multisets, 
as the latter case does not involve the sum of weights. 
However, it is possible to considered weighted multisets, 
in which case the Jaccard index becomes: 


7 X; wi min {a;, bi} 


RE a a 


(21) 


T Continuous Densities and Scalar 
Fields 


The developments discussed in the previous sections pave 
the way for considering also sets corresponding to den- 
sities, such as probability density functions, as well as 
completely generic functions and scalar fields. One of the 
main problem to be overcome here is that densities of- 
ten have infinite support, meaning that they extend over 
infinite ranges in their respective space. 

The problem of comparing two distributions is partic- 
ularly important in many theoretical and applied areas, 
having motivated great interest and the proposal of sev- 
eral respective approaches (e.g. [14, 15]). 

One interesting perspective that can be used to adapt 
the Jaccard and coincidence indices so as to allow com- 
parison of densities is developed as follows. 

We start by representing a generic continuous function 
in terms of a respective discretization, with resolution Az, 
as illustrated in Figure 6. 


The density p(x) becomes the vector p = |x; Ax. Now, 


p(x) 


tat X 
Ax 





Figure 6: A generic density function p(x) being discretized with 





resolution Az, so that it can be represented as a vector p= |x;] Arc. 
The integral of the original and respective discretization are as- 
sumed to be 1, so that they are both normalized as densities. 


in a vector the order of the elements is all important, but 
it is also possible to relax this constraint and represent 
the discretized function as the respective multiset: 


A= {[z1, m(x1); rey (i m(z;)]; ea ET mZn)|t (22) 


where m(x;) is the multiplicity of the element x; gen- 
eralized to take real values. In addition, we have also as- 
sumed, for simplicity’s sake, that the discretization takes 
place on n points, which are henceforth understood as the 
support of both the function and the multiset. It is pro- 
posed here that functions transformed into their respec- 
tive multisets can be called multifunctions, or mfunctions. 

Now, let ma(x;) and mg(a;) correspond to the mul- 
tiplicity of the elements in the two sets obtained by dis- 
cretization of two density functions p4(x) and pg(x) as- 
suming the same support. Though the order of the co- 
ordinates is not considered in multisets, the respective 





multiset Jaccard can be nevertheless obtained as: 


> ico min(ma(xi), MB (zi) 


IP(A, B) = Dm max(m4a(z£j), mB(x;)) 


(23) 





This index is then capable of expressing the similarity 
between the two original densities up to the x—resolution 
Ax. The above reasoning extends immediately to discrete 
probability densities, and the Dirac delta approach can be 
applied conveniently here. 

By making A, — 0, we then obtain: 


fa min(ma(x), mp(a))dax 


ca fa max(ma(x), mp(a))dax 


(24) 


where œ is the common support of the two multisets A 
and B. 

As such, the Jaccard index can be understood to cor- 
respond to a functional derived fro the two functions or, 
perhaps more specifically, an mfunctional. 

The above result extends immediately to density func- 


tions on higher dimensional domains as: 


J, min(m,4(Z), mp(#))dz 


a J, max(ma(#), mg (2))d7 


(25) 

Which provides a means to apply the multiset Jaccard 
index to continuous or discrete density functions for any 
number of random variables. 


As an example, lets consider the two density functions 
Xa and Xp depicted in Figure 7(a). 


(a) 


P(x), q(x) 
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p(x) q(x) 
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(c) 


p(x)U q(x) 
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Figure 7: Two probability density functions p(x) and q(x) (a), with 
respective intersection and union as shown in (b) and (c). This 
situation yields a Jaccard index equal to 0.41. The maximum value 








1 is obtained whenever the two densities are identical. 


The respective intersection X 4M.X g and union X 4UX Bg 
of these two densities, obtained by using the minimum 
and maximum operation between the elements of pair of 





values, are presented in Figures 7(a) and (b), respectively. 
The Jaccard index, obtained by dividing the area of the 
interesection curve by the area of the union, yielded a 
value of 0.09257. 

The Jaccard index can be also adapted to quantify the 
separation of two groups of points, or clusters. The basic 





idea here would be to represent each of the clusters in 


terms of the joint probability density and then apply the 
Jaccard index over them by considering the densities as 





the respective multiplicity of every element. This method 
can be applied to any number of involved features. 

Though we have so far considered both X4(%) and 
X p(x) to correspond to non-negative scalar fields with 
hypervolume 1, it is actually possible to employ the Jac- 
card and coincidence indices to quantify the similarity 
between any two scalar mfunctions or mfields ¢4 (7) and 
p(T) sharing the same domain, even in presence of neg- 
ative multiplicities. 

A possible manner to adapt the Jaccard index to nega- 
tive multiplicities in a two-dimensional space is as follows. 
In case the pair of points [m(X4), m(Xp)|] is in the first 
quadrant, the minimum and maximum between the two 
multiplicities are accumulated into the intersection and 
union integral, respectively. It the point is in the third 
quadrant, both coordinates have their signal inverted and 
the respective minimum and maximum are accumulated. 

Otherwise, if the point belongs to the II or IV quadrant, 
the point [m(X4),m(XBg)] is mirrored into the opposite 
quadrant respectively to the vertical axis, and it is the 
negative of the minimum between the multiplicities that 
is then accumulated into the intersection integral (to com- 
pensate for the reflection), while the union is taken with 
positive sign, and the resulting accumulated intersection 
and union are then used in Equation 25 to obtain the 
respective Jaccard index. 


m(Xp) 








m(xa) 





Figure 8: Jaccard for mfunctions with negative multiplicity. Points 
in the II and IV quadrants are reflected with respect to the vertical 
axes, and their intersection (minimum values) and union (maximum 
values) enter with negative values in the accumulated intersection 
and union. 


For instance, let’s calculate the multisets Jaccard index 
for the functions f(x) = cos(@) and g(x) = sin(@) for a 
complete period 0 < 0 < 27, as illustrated in Figure 9. 


The intersection of two mfunctions with negative values 
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Figure 9: The Jaccard index calculated for a cosine and a sine 





function. The mfunctions are shown in (a), and the respective in- 
teresection and union mfunctions are shown in (b). The obtained 
Jaccard index was equal to 0. Indices of 1 and -1 will be obtained 
in case g(x) = cos(t) and g(x) = —cos(t), respectively. 


involves a binary operator analogous to the inner product 
in function spaces, in which the multiplicities are reflect 
among the four quadrants depending on their signs [4]. 

A further example of the Jaccard index adapted to mul- 
tidimensional scalar fields, namely a gray level image, also 
incorporating the respective scatterplot representation of 
the paired multiplicities is provided in Figure 10. 


8 Joint Variations 


Joint variation are often taken in a normalized manner 
as when using the Pearson correlation coefficient. More 
specifically, we have that this coefficient can be under- 
stood as corresponding to the variance provided the sam- 
ples of the two sets have been first standardized. By stan- 
dardization it is henceforth understood that, given a ran- 
dom variable X, we apply the following random variable 
transformation: 

X — Ux 

Ox 


<= (26) 


This standardization has the effect of normalizing the 
dispersions of a random variables, so that the its variance 
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Figure 10: A gray level image of flowers img[x, y] (a) was mixed with 
random noise uniformly distributed between —0.5 and 0.5, resulting 
in the noisy image img|[z, y| + €[z,y] shown in (b). The resulting 
scatterplot is depicted in (c), including the identity line defining the 
two regions for calculation of the scalar field intersection and union, 
from which a respective Jaccard index of 7(img,img + £) = 0.83 
was obtained, reflecting a relatively high similarity between the two 
scalar fields. 





becomes 1 while the average is 0. It can also be verified 


that a standardized random variable will present most of 
its observations within the interval [—2, 2]. 

In the case of a set of N observations of two standard- 
ized random variables, the Pearson correlation coefficient 
becomes: 


PIXY) = 5 do lth (27) 


When two standardized random variables X and Y are 





taken jointly, they define a scatterplot providing a useful 
illustration about the interrelationship between the two 
considered values. This scatterplot can be immediately 
understood as corresponding to a sampling of the joint 
probability density of the two random variables, which 





may be kernel expanded to obtain an estimation of the 
respective counterpart. 

It constitutes an interesting issue considering if it may 
be possible to obtain an alternative joint variation quan- 
tification based on the Jaccard or coincidence indices. 

Since joint distributions of points in scatterplots often 
incorporate negative values of the variables, it becomes 
necessary to employ the procedure described in Section 7 
in order to cope with the negative values. 

In order to illustrate the possibility to quantify the joint 
variation of observations in a scallterplot (or, actually, 
joint densities), we consider the situation in Figure 11, 
which shows several scatterplots drawn from normal den- 
sities with increasing correlation. 


It is also interesting to observe that the comparison 
of two densities can be shown as a scatterplot, with the 
two density functions defining a parametric curve. This 
is illustrated in Figure 12 


9 Multiple Sets 


We have so far considered indices applied to two sets or 
entities. ‘There are two basic ways in which more sets 
can be taken into account. The first one is simply to 
understand that each of the two sets A and B are obtained 
by set operation combinations among several other sets. 
for instance, we may have A = (CM D)UE — F and 
B= CUG. We may write: 


A= f(C, D, E, F) 


Observe that there is absolute no restriction on these 
functions, except that they are not both empty sets. 
The Jaccard index for the example above can be ex- 
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Figure 11: Comparision of the Pearson correlation coefficient and the multiset Jaccard index for negative multiplicities with respect to 
several distribution of points with increasing correlation. Interestingly, the Jaccard index seems to provide a more gradual quantification of 





the joint variations that is probably more compatible with our perception. At the same time, the Pearson correlation coefficients tend to 
saturate as the correlation increases. 


pressed as: The extension of the interiority index becomes: 


ANBNC 
|A(C, D, E, F) A B(C,G) Rath R E | | 


IAC, D, E, F), BCG) = Gene RU BCG) | min {|A] ,|B],|Cl} 


Mbereronewa vastine post ble Corabinat ons-or ci: It can be verified that this extended interiority index 


verse sets become possible, but they will ultimately always now quantifies how much the smallest of the sets is con- 


lead to two resulting sets A and B to be compared by the tained in the overall intersection. However, it does not 





ae onthe take into account how the intermediate size set relates 
Jaccard or coincidence indices. 


There is another interesting possibility to take into ac- to the mutual intersection. This can be accomplished by 


count more than 2 sets, and this corresponds to extending introducing a second interiority index as: 


the Jaccard index, for instance in the case involving 3 sets, IANBNC 
Ti3,9)(A, B,C) = ——— a TT Oy 
as: ae ee min {{|A] , |B] ,|C]} — min {|A], |B], |C]}} 
J3(A, B,C) = [AUBUC The two obtained interiority indices can then be com- 


bined into a single respective index as: 
with 0 < J (A, B,C) < 1. This concept can be imme- 


diately extended to any number Ns of sets. T3(A, B,C) = T31; (4, B,C) Tia (4, B,C) 
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Figure 12: The two probability densities p(x) and q(x) in Figure 7 
shown as a parametric curve in the respective scatterplot. In case 








of discrete densities, they can be represented in terms of parametric 
curves related to the joint observations. Continuous densities can 
be represented in a similar manner. 





It is also possible to assign 
weights to the mass distributions so obtained in the scatter plot, 
which may reflect the relative important to each specific problem or 
the repetition of observations. The identity line, shown in salmon, 
partitions the scatterplot space into the two regions U and D. 


with 0 < 73(A, B,C) < 1. 
We can now define the coincidence index extended to 





three sets as: 
C3(A, B,C) = 73(A, B,C) B(A, B,C) 


A similar development applies to more than 3 sets. 

The consideration of more than 2 sets in similarity in- 
dex suggests other possible extensions of the Jaccard and 
coincidence indices. For instance, it becomes interesting 
not only to quantify the overall similarity between 3 sets, 
but also to develop indices capable of reflecting how these 
three sets are connected one another. Consider the situ- 
ation depicted in Figure 13. 


Q 


Figure 13: Three sets A, B and ČC characterized by sequential, or 
chained intersections. In the suggested approach, B is taken as a 
candidate reference for intermediating the other two sets through a 





chaining relationship. 


This situation suggests that set B intermediates the 
connection between the sets C follows and A, therefore 
establishing a chaining relationship. The Jaccard index 
with 2 sets cannot cope directly with this situation. 
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A possible index involving three sets that can quantify 
the chaining between 3 sets is: 


X(A, B,C) = I(B, (AN B)U(BNC)) U- IJA, C) 
As an example, let’s consider: 


A = {a,b,c, d,e, f, g9}; 
B=46 J; gdu ky: 
C = {i,j k,l, m,n, o} 


It folows that: 


ANB = {e, f, g} ; 

ANC = {}; 

BiVG:=41,9. 8): 

AUC = {a,b,c, d,e, f,g,1,9,k,l,m,n, o} ; 
(AN B)U(BNC) = {e, f,9,7, j,k}; 
Bo[ANB)U(BNO)] = {e, f, 9,7, j, k} 
BU[ANB)U(BNC)] = {e, f,g,h, 7,7, k} 


So, we have that: 


JI(B,(AN B)U(BNC)) = 


_|BN[ANB)U(BNC)]| 6 98 
~ |BULANB)U(BNC)]| 7 G 
and: 
ATL 
a aa (29) 


From which we obtain the chaining index value of: 


X(A, B,C) = 


J(B,(ANB)U(BNC)) [1 — J(A, C) 
6 


T 


1-0 


sio | 


which provides an interesting indication of the chaining 
between the sets A, B, and C. Observe that the above 
described approach assumes that set B has been adopted 
as a reference for implementing the chaining between A 
and C. More generic situations can be addressed by con- 
sidering successive pairwise combinations. 

It should be observe that it is possible that one of the 
intersections betwen B and A or C is large enough to 
bias the above index. In these situations, it is possible 
to incorporate an additional index specifying a minimum 
overlap between both A and B as well as B and C. 

Several other analogous chaining indices involving 3 or 
more sets or other structures are possible, leading to com- 
plementary properties. 


10 The Jaccard and Coincidence 
Indices and Modeling 


By allowing several types of mathematical structures to 





have their relationships being quantified in terms of re- 
spective indices, it becomes possible to objective and 
quantitatively address a wide range of theoretical and 
practical problems, while also catering for the considera- 
tion of stochasticity. 

In addition, the several indices discussed and suggested 
in this work represent a valuable resource while developing 
models (e.g. [16]) through the combination of datasets as 
described in [1]. 

Then, we have several possibilities of applying these 
indices. For instance, a new dataset can be compared 
to those already modeled by using the similarity indices. 
Also of particular interest is to identify which combi- 
nations, through set operations, between the existing 
datasets associated to models are more likely to account 
for other datasets of interest, therefore providing insights 
about how respective models can be identified, related, or 
developed. 

The discussed indices are also interesting from the 
perspective of characterizing, developing, validating and 
applying pattern recognition and deep learning ap- 
proaches |17, 18, 19]. 


11 Concluding Remarks 


Relationships between the several important mathemat- 
ical structures — including sets, functions, vectors, den- 
sities, and graphs — are critically important in virtually 
all areas where mathematics is employed. Given its in- 
teresting features, the Jaccard index has been extensively 
employed in a large range of scientific and technological 
situations. Also as a consequence of its potential, the Jac- 
card index has been generalized in a variety of manners. 

The present work aimed at generalizing further the Jac- 
card index. One of the first discussed possibilities con- 
sisted in using the interiority index, capable of quantify- 
ing how much a set is contained into another, as means to 
complement a identified limitation of the Jaccard index in 
taking into account the interiority of one set into the other 
. This index was then combined with the Jaccard index to 
yield the coincidence index, which is believed to provide 





a more strict quantification of the similarity between sets. 
The possibility to adopt the sum of multisets instead of 
the union was also addressed, with promising results for 
the situations where the multiplicity of the elements have 
to be fully taken into in account. 

The possibility to apply the Jaccard and coincidence 
indices on continuous sets was then addressed by con- 
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sidering the areas of the involved regions in place of the 
number of elements in the involved sets. This adaptation 
of the Jaccard index allowed the consideration of density 
fields and functions, which was approached by using the 
Jaccard index for multisets. The potential of this gener- 
alization of the Jaccard index was then briefly illustrated 
with respect to probability density functions as well as ina 
comparison between the cosine and sine functions, which 
are not normalized and can take negative values, as well 
as a real-world image and a respective noise version. 

The intrinsic relationship between similarity indices 
and statistical quantifications of joint variation between 
random variables was approached subsequently, and it has 
been argued that both the Pearson correlation coefficient 
can be used to compare two density functions, but also 
that a respective adaptation of the Jaccard and coinci- 
dence indices can also be used for that finality. We also 
discussed the interesting possibility to visualize the ac- 
tion of the Jaccard and coincidence indices with respect 
to the division of the data into two regions defined by the 
identity line in the scatterplot distribution. 

The also interesting situation of similarity and other 
indices considering three or more sets was then discussed, 
identifying the possibility to consider the two sets involved 
in the basic Jaccard and coincidence index as correspond- 
ing to the result of set operation combinations between 
any number of other sets. Another important extension 
was considered with respect to taking into account more 
than 2 sets as arguments for the similarity indices, which 
was illustrated in terms of a suggested index to quantify 
the chaining between three sets. 

Several are the further possible works motivated by 
the concepts and methods reported and suggested in this 
work, a more complete list of which would be particularly 
extensive. Some of the possibilities include comparing 
the described indices with other indicators of similarity, 
the identification of other types of relationships that can 
be quantified when considering 3 or more sets and ana- 
logue generalizations of other interesting indices, as well 
as extending the described indices to other mathematical 
structures. In addition, as observed in Section 10, similar- 





ity and other indices such as those addressed here provide 
valuable means for developing and evaluating models of 
data as well as for several pattern recognition and deep 
learning tasks. 
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